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To appear as Chapter 5 of Quantitative Biology: From Molecular to Cellular Systems, ME Wall, 
ed. (Taylor and Francis, 2011). In this Chapter, we ask questions (1) What is the right way to 
measure the quality of information processing in a biological system? and (2) What can real-life 
organisms do in order to improve their performance in information-processing tasks? We then review 
the body of work that investigates these questions experimentally, computationally, and theoretically 
in biological domains as diverse as cell biology, population biology, and computational neuroscience. 



I. LIFE IS INFORMATION PROCESSING 

All living systems have evolved to perform certain 
tasks in specific contexts. There are a lot fewer tasks 
than there are different biological solutions that the na- 
ture has created. Some of these problems are univer- 
sal, while the solutions may be organism-specific. Thus 
a lot can be understood about the structure of biologi- 
cal systems by focusing on understanding of what they 
do and why they do it, in addtion to how they do it 
on molecular or cellular scales. In particular, this way 
we can uncover phenomena that generalize across differ- 
ent organisms, thus increasing the value of experiments 
and building a coherent understanding of the underlying 
physiological processes. 

In this Chapter, we will take this point of view while 
analyzing what it takes to do one of the most common, 
universal functions performed by organisms at all lev- 
els of organization: signal or information processing and 
shaping of a response (these are variously known in differ- 
ent contexts as learning from observations, signal trans- 
duction, regulation, sensing, adaptation, etc.) Studying 
these types of phenomena poses a series of well-defined, 
physical questions: How can organisms deal with noise, 
whether extrinsic or generated by intrinsic stochastic 
fluctuations within molecular components of information 
processing devices? How long should the world be ob- 
served before a certain inference about it can be made? 
How is the internal representation of the world made and 
stored over time? How can organisms ensure that the 
information is processed fast enough for the formed re- 
sponse to be relevant in the ever-changing world? How 
should the information processing strategies change when 
the properties of the environment surrounding the organ- 
ism change? In fact, such "information processing" ques- 
tions have been featured prominently in studies on all 
scales of biological complexity, from learning phenomena 
in animal behavior [IHZ], to analysis of neural computa- 
tion in small and large animals 8 16], and to molecular 
information processing circuits fTTHlS], to name just a 
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few. 

In what follows, we will not try to embrace the unem- 
braceable, but will instead focus on just a few questions, 
fundamental to the study of signal processing in biology: 
What is the right way to measure the quality of infor- 
mation processing in a biological system? and What can 
real-life organisms do in order to improve their perfor- 
mance in these tasks? 



The field of study of biological information processing 
has undergone a dramatic growth in the recent years, 
and it is expanding at an ever growing rate. There are 
now entire conferences devoted to the related phenom- 
ena (perhaps the best example is The International q-bio 
Conference on Cellular Information Processing, http : 
|//q-bio . org| held yearly in Santa Fe, NM, USA). Hence, 
in this short chapter, we have neither an ability, nor a de- 
sire to provide an exhaustive literature review. Instead 
the reader should keep in mind that the selection of ref- 
erences cited here is a biased sample of important results 
in the literature, and I apologize profusely to my friends 
and colleagues who find their deserving work omitted in 
this overview. 



II. QUANTIFYING BIOLOGICAL 
INFORMATION PROCESSING 

In the most general context, a biological system can 
be modeled as an input-output device, cf. Fig. [I] that ob- 
serves a time-dependent state of the world s(t) (where s 
may be intrinsically multidimensional, or even formally 
infinite dimensional), processes the information, and ini- 
tiates a response r(t) (which can also be very large dimen- 
sional). In some cases, in its turn, the response changes 



The main questions addressed in this review: 

• What is the right way to measure the qual- 
ity of information processing in a biologi- 
cal system? 

• What can real-life organisms do in or- 
der to improve their performance in these 
tasks? 
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the state of the world and hence influences the future 
values of s(t), making the whole analysis so much harder 
[2S] . In view of this, analyzing the information process- 
ing means quantifying certain aspects of the mapping 
s(t) — > r(t). In this section, we will discuss the proper- 
ties that this quantification should possess, and we will 
introduce the quantities that satisfy them. 

A. What is needed? 

One typically tries to model molecular or other physio- 
logical mechanisms of the response generation. For exam- 
ple, in well-mixed biochemical kinetics approaches, where 
s(t) may be a ligand concentration, and r(t) may be an 
expression level of a certain protein, we often write 

= F a {r,s,h) - G a (r,s,h) +r) a (r,s,h,t), (1) 

where the nonnegative functions F a and G a stand for 
the production/degradation of the response, influenced 
by the level of the signal s, and r) is a random forcing due 
to the intrinsic stochasticity of chemical kinetics at small 
molecular copy numbers [27 . The subscript a stands for 
the values of adjustable parameters that define the re- 
sponse (such as various kinetic rates, concentrations of in- 
termediate enzymes, etc.), which themselves can change, 
but on time scales much slower than the dynamics of s 
and r. In addition, h stands for the activity of other, 
hidden cellular state variables, which change according 
to their own dynamics, similar to Eq. ([I]). This dynam- 
ics can be written for many diverse biological information 
processing systems, including the neural dynamics, where 
r will would stand for the firing rate of a neuron induced 
by the stimulus [25] . 

Importantly, because of the intrinsic stochasticity in 
Eq. ([IJ, and because of the effective randomness intro- 
duced by the state of the hidden variables, the map- 
ping between the stimulus and the response is non- 
deterministic, and it is summarized in the probability 
distribution P [{r(t}} \ {s(t)} , {h(t)} , a], or, marginaliz- 
ing over h, P[{r(t)}[{s(t)},a] = P a [{r(t)}\{s(t)}}. In 
addition, s(t) itself is not deterministic either: other 
agents, chaotic dynamics, statistical physics effects, and, 
at a more microscopic level, even quantum mechanics 




FIG. 1: Biological information processing and interactions 
with the world. In this review we leave aside the feedback 
action between the organism internal state and the state of the 
world and focus on the signal processing and the adaptation 
arrows. 



conspire to ensure that s(t) can only be specified prob- 
abilistically. Therefore, a simple mapping s — > r is re- 
placed by a joint probability distribution (note that we 
will drop the index a in the future where it doesn't cause 
ambiguities) 

P[{r(t)}\{s(t)},a}P[{s(t)}} 

= P[{r(t)},{s(t)}\a] = P a [{r(t)},{s(t)}}- (2) 

Hence the measure of the quality of the biological in- 
formation processing must be a functional of this joint 
distribution. 

Biological information processing is almost al- 
ways probabilistic. 



Now consider, for example, a classical system studied 
in cellular information processing: the E. coli chemotaxis 
(see Chapter 15 in this book) [25]. This bacterium is 
capable of swimming up gradients of various nutrients. 
In this case, the signal s(t) is the concentration of such 
extracellular nutrients. The response of the system is 
the activity levels of various internal proteins, like c/ieY, 
cheA, cheB, cheR, etc., which combine to modulate the 
cellular motion through the environment. It is possible 
to write the chemical kinetics equations that relate the 
stimulus to the response accurately enough and even- 
tually produce the sought after conditional probability 
distribution P a [{r(t)} \ {s(t)}}. However, are the ligand 
concentrations the variables that the cell "cares" about? 
In this system, it is reasonable to assume that all pro- 
tein expression states that result in the same intake of 
the catabolite are functionally equivalent. That is, the 
goal of the information processing performed by the cell 
likely is not to serve as a passive transducer of the signal 
into the response [2H [25] , but to make an active com- 
putation that extracts only the part of the signal that is 
relevant to making behavioral decisions. We will denote 
such relevant aspects of the world as e(t). For example, 
for the chemotactic bacterium, e(t) can be the maximum 
nutrient intake realizable for a particular spatiotemporal 
pattern of the nutrient concentration. 

In general, e is not a subset of s, or vice versa, and 
instead the relation between s and e is also probabilis- 
tic, P[{e(t)} | {s(i)}], and hence the relevant variable, the 
signal, and the response form a Markov chain: 

P[{e(t)},{s(t)},{r(t)}} = 

P [{e(t)}] P [{s(t)} | {e(t)}] P [{r(t)} | {s(t)}] . (3) 

The quantity we are seeking to characterize the biolog- 
ical information processing must respect this aspect of 
the problem. Therefore, its value must depend explicitly 
on the choice of the relevance variable: a computation 
resulting in the same response will be either "good" or 
"bad" depending on what this response is used for. In 
other words, one needs to know what the problem is be- 
fore saying if a solution is good or bad. 
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It is impossible to quantify the information pro- 
cessing without specifying the purpose of the 
device; that is, the relevant quantity that it is 
supposed to compute. 



B. Introducing the right quantities 

The question of how much can be inferred about a 
state of a variable X from measuring a variable Y has 
been answered by Claude Shannon over sixty years ago 
[30j . Starting with basic, uncontroversial axioms that a 
measure of information must obey, he derived that the 
uncertainty in a state of a variable is given by 



S[X] = -^P(z)logP(z) = -{\ogP{x)) Pl 



x) 1 



(4) 



which we know now as the Boltzmann- Shannon entropy. 
Here (■ ■ • )p denotes averaging over the probability dis- 
tribution P. When the logarithm in Eq. Q is binary 
(which we always assume in this Chapter), then the unit 
of entropy is a bit: one bit of uncertainty about a variable 
means that the latter can be in one of two states with 
equal probabilities. 

Observing the variable Y (a.k.a. conditioning on it) 
changes the probability distribution of X, P{x) — > 
P(x\y), and the difference between the entropy of X prior 
to the measurement and the average conditional entropy 
tells how informative Y is about X: 



I[X-Y)= S[X]- {S[X\Y]) P{V) 



(5) 



(logP(x)) Pix) + ((\ogP(y\x)) p(xly) 



P(y) 



i g P ^y) ) 

g P(x)P(y)/ 



(6) 
(7) 



P(x,y) 



The quantity I[X; Y] in Eq. ^ is known as mutual in- 
formation. As entropy, it is measured in bits. Mutual 
information of one bit means that specifying the variable 
Y provides us with the knowledge to answer one yes/no 
question about X. 

Entropy and information are additive quantities. That 
is, when considering entropic quantities for time series 
data defined by P[{x{t)}}, for < t < T, the entropy of 
the entire series will diverge linearly with T. Therefore, 
it makes sense to define entropy and information rates 
(301 



S[X] = lim 

1 — too 



S[x{0 < t < T)}} 
T • 



(8) 



X[X;Y] = lim ^{0 <t <T)};{y(0 <t <T)}] 

which measure the amount of uncertainty in the signal 
and the reduction of this uncertainty by the response per 
unit time. 



Entropy and mutual information possess some simple, 
important properties |31j : 

1. Both quantities are non-negative, < S[X] and 
< I[X;Y] <mm{S[X],S[Y]). 

2. Entropy is zero if and only if {iff) the studied vari- 
able is not random. Further, mutual information 
is zero iff P{x,y) = P{x)P{y), that is, there are 
no any kind of statistical dependences between the 
variables. 

3. Mutual information is symmetric, I[X;Y] = 
I[Y-X). 

4. Mutual information is well defined for continuous 
variables; one only needs to replace sums by in- 
tegrals in Eq. Q. On the contrary, entropy for- 
mally diverges for continuous variables (any truly 
continuous variable requires infinitely many bits to 
be specified to an arbitrary accuracy), but many 
properties of entropy are also exhibited by the dif- 
ferential entropy, 



S[X] = - [ dxP{x)logP{x), 

J X 



(10) 



which measures the entropy of a continuous distri- 
bution relative to the uniformly distributed one. In 
this Chapter, S[X] will always mean the differential 
entropy if x is continuous and the original entropy 
otherwise. 

5. For a Gaussian distribution with a variance of a 2 , 

S = 1/2 logo- 2 + const, (11) 

and, for a bivariate Gaussian with a correlation co- 
efficient of p, 



I[X;Y] = -l/2\og{l-p 2 ). 



(12) 



Thus entropy and mutual information can be 
viewed as generalizations of more familiar notions 
of variance and covariance. 

6. Unlike entropy, mutual information is invariant un- 
der reparameterization of variables. That is 



I[X;Y]=I[X';Y'] 



(13) 



for all invertible x' = x'{x), y' = y'{y). That is, 
/ provides a measure of statistical dependence be- 
tween X and Y that is independent of our subjec- 
tive choice of the measurement device [TUT] . 



C. When the relevant variable is unknown: The 
value of information about the world 

One of the most fascinating properties of mutual in- 
formation is the Data Processing Inequality |31j . 
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Suppose three variables X, Y, and Z form a Markov 
chain, P(x,y,z) = P(x)P{y\x)P{z\y). In other words, 
Z is a probabilistic transformation of Y, which, in turn, 
is a probabilistic transformation of X. Then it can be 
proven that 

I[X;Z]<min(I{X;Y],I[Y;Z]). (14) 

That is, you cannot get new information about the orig- 
inal variable by further transforming the measured data; 
any such transformation cannot increase the information. 

Together with the fact that mutual information is zero 
iff the variables are completely statistically independent, 
the Data Processing Inequality suggests that if the vari- 
able of interest that the organism cares about is unknown 
to the experimenter, then analyzing the mutual infor- 
mation between the entire input stimulus (sans noise) 
and the response may serve as a good proxy. Indeed, 
due to the Data Processing Inequality, if I[S; R] is small, 
then I[E; R] is also small for any mapping S — > E of the 
signal into the relevant variable, whether deterministic, 
e = e(s), or probabilistic, P{e\s). In many cases, such as 
[TBI [2TJ [35] , this allows us to stop guessing which calcula- 
tion the organism is trying to perform and to put an up- 
per bound on the efficiency of the information transduc- 
tion, whatever an organism cares about. However, as was 
recently shown in the case of chemotaxis in E. coli, when 
e and s are substantially different (resource consumption 
rate vs. instantaneous surrounding resource concentra- 
tion), maximizing I[S;R) is not necessarily what organ- 
isms do [25] . 

Information about the outside world is the up- 
per bound on information about any of its fea- 
tures. 



Another reason to study the information about the 
outside world comes from the old argument that relates 
information and game theory |33j . Namely, consider a 
zero-sum probabilistic betting game (think of a roulette 
without the zeros, where the red an the black are two 
equally likely outcomes, and betting on the right outcome 
doubles one's investment, while betting on the wrong one 
leads to a loss of the bet). Then the logarithmic growth 
rate of one's capital is limited from above by the mu- 
tual information between the outcome of the game and 
the betting strategy. This was recently recast in the 
context of population dynamics in fluctuating environ- 
ments [34H36] . Suppose the environment surrounding a 
population of genetically identical organisms fluctuates 
randomly with no temporal correlations among multiple 
states with probabilities P(s). Each organism, indepen- 
dently of the rest, may choose among a variety of pheno- 
typical decisions d, and the log-growth rate depends on 
the pairing of s and d. Evolution is supposed to maximize 
this rate, averaged over long times. However, the current 
state of the environment is not directly known, and the 
organisms may need to respond probabilistically. While 
the short-term gain would suggest choosing the response 



that has the highest growth rate for the most probable 
environment, the longer term strategy would require bet- 
hedging |37j . with different individuals making different 
decisions. 

Suppose an individual now observes the environment 
and gets an imperfect internal representation of it, r, 
with the conditional probability of P(r\s). What is the 
value of this information? Under very general conditions, 
this information about the environment can improve the 
log-growth rate by as much as I[S; R] [33]. In more gen- 
eral scenarios, the maximum log-growth advantage over 
uninformed peers needs to be discounted by the cost of 
obtaining the information, by the delay in getting it |35| . 
and, more trivially, by the ability of the organism to uti- 
lize it. Therefore, while these brief arguments are far 
from painting a complete picture of relation between in- 
formation and natural selection, it is already clear that 
maximization of the information between the surround- 
ing world and the internal response to it is not an aca- 
demic exercise, but is directly related to fitness and will 
be selected for by evolution. 102 

Information about the outside world puts an 
upper bound on the fitness advantage of an in- 
dividual over uninformed peers. 



It is now well known that probabilistic bet hedging is 
the strategy taken by bacteria for survival in the presence 
of antibiotics j35J [39] and for genetic recombination [4"0T - 
In both cases, cell division (and hence population 
growth) must be stopped either to avoid DNA damage 
by antibiotics, or to incorporate newly acquired DNA 
into the chromosome. Still, a small fraction of the cells 
choose not to divide even in the absence of antibiotics to 
reap the much larger benefits if the environment turns 
sour (these are called the persistent and the DNA uptake 
competent bacteria for the two cases, respectively). How- 
ever, it remains to be seen in an experiment if real bac- 
teria can reach the maximum growth advantage allowed 
by the information-theoretic considerations. Another in- 
teresting possibility is that cancer stem cells and ma- 
ture cancer cells also are two probabilistic states chosen 
to hedge bets against interventions of immune systems, 
drugs, and other surveillance mechanisms |43j . 



D. Time dependent signals: Information and 
Prediction 

In many situations, such as persistence in the face of 
antibiotics treatment mentioned above, an organism set- 
tles into a certain response much faster than the envi- 
ronment has a chance to change again. In these cases, 
it is sufficient to consider the same-time mutual informa- 
tion between the signals and the responses, as in |21) . 
I[s(t); r(t)] = I[S; R], which is what we've been doing up 
to now. 

More generally, formation of any response takes time, 
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which may be comparable to time scales of changes of the 
stimuli. What are the relevant quantities to character- 
ize biological information processing in such situations? 
Traditionally, one either considers delayed informations 

I T [S;R]=I[s(t);r(t + T)], (15) 

where r may be chosen as r = argmax t / Zfsft); r(t + t')], 
or studies information rates, as in Eq. {9}. The first 
choice measures the information between the stimulus 
and the response most constrained by it; typically this 
would be the response formed a certain characteristic sig- 
nal transduction time after the stimulus occurrence. The 
second approach looks at correlations between all possi- 
ble pairs of stimuli and responses separated by different 
delays. 

While there are plenty of examples of biological sys- 
tems where one or the other of these quantities is op- 
timized, both of these approaches are insufficient. X T 
doesn't represent all of the gathered information since 
bits at different moments of time are not independent of 
each other. Further, it does not take into the account 
that temporal correlations in the stimulus allow to pre- 
dict it, and hence the response may be formed even before 
the stimulus occurs. On the other hand, the information 
rate does not distinguish among predicting the signal, 
knowing it soon after it happens, or having to wait for 
T — > oo in order to be able to estimate it from the re- 
sponse. 

To avoid these pitfalls, one can consider the informa- 
tion available to an organism that is relevant for specify- 
ing not all of the stimulus, but only of its future. Namely, 
we can define the predictive information about the stim- 
ulus available from observation of a response to it of a 
duration T, 

Ipred[R(T); S] = I[{r(-T < t < 0)}; {s(t > 0)}]. (16) 

This definition is a generalization of the one used in 
[45], which had r(t) = s(t), and hence calculated the 
upper bound on / prc d over all possible sensory schemes 
P[{r(t)}\{s(t)}}. 

All of the /p r cd bits are available to be used instan- 
taneously, and there is no specific delay r chosen a pri- 
ori and semi-arbitrarily. The predictive information is 
nonzero only to the extent that the signal is temporally 
correlated, and hence the response to its past values can 
say something about its future. Thus focusing on pre- 
dictability may resolve a traditional criticism of informa- 
tion theory that bits don't have an intrinsic meaning and 
value, and some are more useful than the others: since 
any action takes time, only those bits have value that 
can be used to predict the stimulus at the time of action, 
that is, in the future [4"5"I|4"6] . 

Predictive information allows to assign an ob- 
jective value to information: only those bits 
are useful that can be used to guide future re- 
sponses. 



The notion of predictive information is conceptually 
appealing, and there is clear experimental and computa- 
tional evidence that behavior of biological systems, from 
bacteria to mammals, is consistent with attempting to 
make better predictions (see, for example, [H)l |4"TH5"2] for 
just some results). However, even almost ten years after 
I pic d was first introduced, it still remains to be seen ex- 
perimentally if optimizing predictive information is one of 
the objectives of biological systems, and whether popula- 
tion growth rates in temporally correlated environments 
can be related to the amount of information available to 
predict them. Some of the reasons for the relative lack of 
progress may be practical considerations that estimation 
of informations among nonlinearly related multidimen- 
sional variables [TH [531 1M] or extracting the predictive 
aspects of the information 55J from empirical data is 
hard, while for simple Gaussian signals and responses 
with finite correlation times, optimization of predictive 
information reduces to a much more prosaic matching of 
Wiener extrapolation filters [56] . 



III. IMPROVING 
INFORMATION-PROCESSING PERFORMANCE 

Understanding the importance of information about 
the outside world and knowing which quantities can be 
used to measure it, we are faced with the next question: 
How can the available information be increased in view of 
the limitations imposed by the physics of the signal and of 
the processing device, such as stochasticity of molecular 
numbers and arrival times, or energy constraints? 



A. Strategies for Improving The Performance 

We start with three main theorems of information the- 
ory due to Shannon [30] . In the source coding theo- 
rem, he proved that to record a signal without losses, 
one needs only S, the signal entropy rate, bits per unit 
time. In the channel coding theorem, he showed that the 
maximum rate of errorless transmission of information 
through a channel specified by f [{?"(£) }|{s(i)}] is given 
by C = max.pfs s n\yX[R; S], which is called the chan- 
nel capacity. Finally, the rate distortion theorem calcu- 
lates the minimum size of the message that must be sent 
error-free in order to recover the signal with an appropri- 
ate mean level of some pre-specified distortion measure. 
None of these theorems considers the time delay before 
the message can be decoded, and typically one would 
need to wait for very long times and accumulate long 
message sequences to reach the bounds predicted by the 
theorems since, for example, responses long time away 
from a certain signal may still carry some information 
about it. 

Leaving aside the complication of dynamics, which one 
may hope to solve some day using the predictive infor- 
mation ideas, these theorems tell us exactly what an or- 
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ganism can do to optimize the amount of information it 
has about the outside world. First, one needs to com- 
press the measured signal, removing as many redundan- 
cies as possible. There is evidence that this happens in 
a variety of signaling systems, starting with the classical 
Refs. 47, 57-59]. Second, one needs to encode the sig- 
nal in a way that allows the transmitted information to 
approach the channel capacity limit by remapping differ- 
ent values of signal into an intermediate signaling vari- 
able whose states are easier to transmit without an error. 
Again, there are indications that this happens in living 
systems [HI |fjfjH6"4"] . Finally, one may choose to focus 
only on important aspects of the signal, such as commu- 
nicating changes in the signal, or thresholding its value 

HSH2SH5SESI. 

If the references in the previous paragraph look some- 
what thin, it is by choice since neither of these approaches 
are unique to biology, and, in fact, most artificial commu- 
nication system use them: a cell phone filters out audio 
frequencies that are irrelevant to human speech, com- 
presses the data, and then encodes it for sending with the 
smallest possible errors. A lot of engineering literature 
discusses these questions 31J, and we will not touch them 
here anymore. What makes biological systems unique is 
an ability to improve the information transmission by 
modifying their own properties in the course of their life. 
This adjusts the a in fa[{?"(£)}|{s(£)}], an d hence mod- 
ifies the conditional probability distribution itself. This 
would be equivalent to a cell phone being able to change 
its physical characteristics on the fly. Unfortunately, as 
the recent issues with the iPhone antenna have shown, 
human engineered systems are no match to biology in 
this regard: they are largely incapable of adjusting their 
own design if the original turns out to be flawed. 



Unlike most artificial systems, living organisms 
can change their own properties to optimize 
their information processing. 



The property of changing one's own characteristics in 
response to the observed properties of the world is called 
adaptation, and the remainder of this section will be de- 
voted to its overview. In principle, we make no distinc- 
tion whether this adaptation is achieved by natural selec- 
tion or by physiological processes that act on much faster 
times scales (comparable to the typical signal dynamics) , 
and sometimes the latter may be as powerful as the for- 
mer [211 [67]. Further, we note that adaptation of the 
response probability distribution and formation of the 
response itself are, in principle, a single process of forma- 
tion of the response on multiple time scales. Our ability 
to separate it into a fast response and a slow adaptation 
(and hence much of the discussion below) depends on ex- 
istence of two well-separated time scales in the signal and 
in the mechanism of the response formation. While such 
clear separation is possible in some cases, it is harder in 
others, and especially when the time scales of the sig- 
nal and the fast response may be changing themselves. 




s 



FIG. 2: Parameters characterizing response to a signal. Left 
panel: the probability distribution of the signal, P{s) (blue), 
and the best-matched steady state dose-response curve r ss 
(green). Top right: mismatched response midpoint. Bottom 
right: mismatched response gain. 

Cases without a clear separation of scales raise a variety 
of interesting questions, but we will leave them aside for 
this discussion. 



B. Three Kinds of Adaptation in Information 
Processing 

We often can linearize the dynamics, Eq. |l]), to get 
the following equation describing formation of small re- 
sponses 

^ = f[s(t)]-kr + r 1 (t,r,s). (17) 

Here r may be an expression of an mRNA following ac- 
tivation by a transcription factor s, or the firing rate of 
a neuron following stimulation. In the above expression, 
/ is the response activation function, which depends on 
the current value of the signal; k is the rate of the first- 
order relaxation or degradation; and rj is some stochastic 
process representing the intrinsic noisiness of the system. 
In this case, r(t) depends on the entire history of {s(t')}, 
t' < t, and hence carries some information about it as 
well. 

For quasi-stationary signals (that is, the correlation 
time of the signal, t s 3> 1/fc), we can write the steady 
state dose-response (or firing rate, or . . . ) curve 

r ss = / [s{t)\ jk, (18) 

and this will be smeared by the noise r\. A typical mono- 
tonic sigmoidal / is characterized by only a few large- 
scale parameters: the range, fm'm and ,/maxi the argu- 
ment S1/2 at the mid-point value (/ m i n + / m ax)/2; and 
the width of the transition region, As (see Fig. [2]). If 
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the mean signal /i = (s(t)) t 3> Si/2j then, for most sig- 
nals, r ss ss fmnx/k and responses to two typical different 
signals si and s 2 are indistinguishable as long as 



dr ss (s) 



ds 



(s 2 - Si) < 



(19) 



where a^/k is the precision of the response resolution ex- 
pressed through the standard deviation of the noise. Sim- 
ilar situation happens when /i <C S1/2 and r ss fmm/k. 
Thus, to reliably communicate information about the sig- 
nal, / should be tuned such that S1/2 ~ [i- If a real 
biological system can perform this adjustment, we call 
this adaption to the mean of the signal, desensetization, 
or adaptation of the first kind. If s 1 / 2 (m) = Mi then the 
adaptation is perfect. This kind of adaptation has been 
observed experimentally and predicted computationally 
in a lot more systems than we can list here, including 
phototransduction, neural and molecular sensing, multi- 
state receptor systems, immune response, and so on, with 
active work persisting to date (see, e.g., Refs. 29, 64, 68- 
176] for a very incomplete list of references on the subject). 
For example, the best studied adaptive circuit in molec- 
ular biology, the control of chemotaxis of E. coli (see 
Chapter 15), largely produces adaptation of the first kind 
[771 [75]. Further, a variety of problems in synthetic biol- 
ogy are due precisely to the mismatch between the typical 
protein concentration of the input signal and the response 
function that maps this concentration into the rate of 
mRNA transcription or protein translation (cf. |79| and 
Chapter 4 in this book). Thus there is an active com- 
munity of researchers working on endowing these circuits 
with proper adaptive matching abilities of the first kind. 

Consider now the quasi-stationary signal taken from 

1 li 

the distribution with a = ((s(i) 2 ) t - \?y > As. Then 
the response to most of the signals is indistinguishable 
from the extremes, and it will be near the midpoint 
~ ('"max + fmin)/2 if o~ <C As. Thus, to use the full 
dynamic range of the response, a biological system must 
tune the width of the sigmoidal dose-response curve to 
As w a. We call this gain control, variance adaptation, or 
adaptation of the second kind. Experiments show that a 
variety of systems exhibit this adaptive behavior as well 
[8"D] . especially in the context of neurobiology [IT] 1ST] , 
and maybe even of evolution [82] . 

These matching strategics arc well known in signal pro- 
cessing literature under the name of histogram equal- 
ization. Surprisingly, they are nothing but a special 
case of optimizing the mutual information I[S: R], as has 
been shown first in the context of information process- 
ing in fly photoreceptors [5J. Indeed, for quasi-steady 
state responses, when noises are small compared to the 
range of the response, the arrangement that optimizes 
I[S;R) is the one that produces P(r) cx l/a r \ s . In 
particular, when u v is independent of r and s, this 
means that each r must be used equiprobably, that is, 
f*{s) — f_ P(s')ds' . Adaptation of the first and the 
second kind follows from these considerations immedi- 
ately. In more complex cases, when the noise variance 



is not small or not constant, derivation of the optimal 
response activation function cannot be done analytically, 
but numerical approaches can be used instead. In partic- 
ular, in transcriptional regulation of the early Drosophila 
embryonic development, the matching between the re- 
sponse function and the signal probability distribution 



has been observed for nonconstant a 



However, 



we caution the reader that, even though adaptation can 
have this intimate connection to information maximiza- 
tion, and it is essentially omni-present, the number of 
systems where the adaptive strategy has been analyzed 
quantitatively to show that it results in optimal informa- 
tion processing is not that large. 

We now relax the requirement of quasi-stationarity 
and return to dynamically changing stimuli. We rewrite 
Eq. (17) in the frequency domain, 



[/(«)]* 



k + iuj 



(20) 



which shows that the simple first order (or linearized) ki- 
netics performs low pass filtering of the nonlinearly trans- 
formed signal [HI [19] . As discussed long ago by Wiener 
[5 6) , for given temporal correlations of the stimulus and 
the noise (which we summarize here for simplicity by 
correlation times t s and t^), there is an optimal cut- 
off frequency k that allows to filter out as much noise 
as possible without filtering out the signal. Change of 
the parameter k to match the temporal structure of the 
problem is called the time scale adaptation or adapta- 
tion of the third kind. Just like the first two kinds, time 
scale adaptation also can be related to maximization of 
the stimulus-response mutual information by means of 
a simple observation that minimization of the quadratic 
prediction error of the Wiener filter is, under certain as- 
sumptions, equivalent to maximizing information about 



the signal, cf. Eq. (12 1 



This adaptation strategy is difficult to study experi- 
mentally since (a) detection of variation of the integra- 
tion cutoff frequency k potentially requires observing the 
adaptation dynamics on very long time scales, and (b) 
prediction of optimal cutoff frequency requires knowing 
the temporal correlation properties of signals, which are 
far from trivial to measure (see, e.g., Ref. [53] for a re- 
view on literature on analysis of statistical properties 
of natural signals). Nonetheless, experimental systems 
as diverse as turtle cones [53], rats in matching forag- 
ing experiments [3J, mice retinal ganglion cells [85) . and 
barn owls adjusting auditory and visual maps [86 show 
adaptation of the filtering cutoff frequency in response to 
changes in the relative time scales and/or the variances 
of the signal and the noise. In a few rare cases, includ- 
ing fly self-motion estimation |13j and E. coli chemotaxis 
[87] (numerical experiment), it turned out to be possible 
to show that the time scale matching not only improves, 
but optimizes the information transmission. 



The three kinds of adaptation (to the mean, to 
the variance, and to the time scale of change of 
the signal) can all be related to maximization 
of the stimulus-response information. 



Typically one considers adaptation as a phenomenon 
different from redundancy reduction, and we have ac- 
cepted this view. However, there is a clear relation be- 
tween the two mechanisms. For example, adaptation of 
the first kind can be viewed as subtracting out the mean 
of the signal, stopping its repeated, redundant transmis- 
sion and allowing to focus on the non-redundant, chang- 
ing components of the signal. As any redundancy reduc- 
tion procedure, this may introduce ambiguities: a per- 
fectly adapting system will respond in the same fashion 
to different stimuli, preventing unambiguous identifica- 
tion of the stimulus based on the instantaneous response. 
Knowing statistics of responses on the scale of adaptation 
itself may be required to resolve the problem. This in- 
teresting complication has been explored in a few model 
systems [T3ll8"5] . 



C. Mechanisms of Different Adaptations 

The three kinds of adaptation we consider here can 
all be derived from the same principle of optimizing the 
stimulus-response mutual information, and evolution can 
achieve all of them. However, the mechanisms behind 
these adaptations on physiological, non evolutionary time 
scales and their mathematical descriptions can be sub- 
stantially different, as we describe below. 

The adaptation of the first kind has been studied ex- 
tensively. On physiological scales, it is implemented typi- 
cally using negative feedback loops or incoherent feedfor- 
ward loops, as illustrated in Fig. [3] In all of these cases, 
the fast activation of the response by the signal is then 
followed by a delayed suppression mediated by a mem- 
ory node. This allows the system to transmit changes in 
the signal, and yet to desensetize and return close (and 
sometimes perfectly close) to the original state if the same 
excitation persist. This response to changes in the signal 
earns adaptation of the first kind the name of differen- 
tiating filter. In particular, the feedback loop in E. coli 
chemotaxis j29l [77] or yeast signaling [88] can be repre- 
sented as the feedback topologies in the figure (see Chap- 
ter 15), and different models of Dictyostelium adaptation 
include both feedforward and feedback designs [68 ] 189 ] . 

The different network topologies have different sensi- 
tivities to changes in the internal parameters, different 
tradeoffs between the sensitivity to the stimulus change 
and the quality of adaptation, and so on. However, fun- 
damentally they are similar to each other. This can be 
seen by noting that since the goal of these adaptive sys- 
tem is to keep the signal within the small transition re- 
gion between the minimum and the maximum activation 
of the response, it makes sense to linearize the dynamics 
of the networks near the mean values of the signal and 





x ® V, 

05 O j 




FIG. 3: Different network topologies able to perform adap- 
tation to the mean. The nodes are labeled: s - sig- 
nal, r - response, and m - memory. Sharp arrows indi- 
cate activation/excitation and blunt ones stand for deactiva- 
tion/suppression. The thickness of arrows denotes the speed 
of action (faster action for thicker arrows). 



the corresponding response. Defining £ = s — s, £ = r — f, 
and \ = m — fh, one can write, for example, for the the 
feedback topologies in Fig. [3] 



<K 

dt 
dx 
dt 



-kcC + k^-kxX + vc- (21) 

k xc(~ k xxX + V X i ( 22 ) 



where r\. are noises, and the coefficients are positive 
for the fourth topology, and some of them change their 
signs for the third. Doing the usual Fourier transform of 
these equations (see Ref. [7T] for a very clear, pedagogi- 
cal treatment) and expressing £ in terms of £, r)£, and r/ x , 
we see that it is only the product of fcyC^C 



that matters 

for the properties of the filter, Eq. pi 22). Hence both 



the feedback topologies in Fig. [3] are essentially equiva- 
lent in this regime. Furthermore, as argued in [681 190) . a 
simple linear transformation of £ and x allows to recast 
the incoherent feedforward loops (the two first topologies 
in Fig. [3]) into a feedback design, again arguing that, at 
least in the linear regime, the differences among all of 
these organizations are rather small from the mathemat- 
ical point of view. [TUB"] 

The reason why we can make so much progress in the 
analysis of adaptation to the mean is that the mean is 
a linear function of the signal, and hence it can be ac- 
counted for in a linear approximation. Different network 
topologies differ in their actuation components (that is, 
how the measured mean is then fed back into changing 
the response generation), but averaging a linear function 
of the signal over a certain time scale is the common 
description of the sensing component of essentially all 
adaptive mechanisms of the first kind. 



Adaptation to the mean can be analyzed lin- 
early, and many different designs become simi- 
lar in this regime. 



Variance and time scale adaptations are fundamentally 
different. While the actuation part for them is not any 
more difficult than for adaptation to the mean, adapt- 
ing to the variance requires averaging the square or an- 
other nonlinear function of the signal to sense its cur- 
rent variance, and estimation of the time scale of the 
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signal requires estimation of the spectrum or of the cor- 
relation function (both are bilinear averages). Therefore, 
maybe it is not surprising that the literature on mathe- 
matical modeling of mechanisms of these types of adap- 
tation is rather scarce. While functional models corre- 
sponding to a bank of filters or estimators of environmen- 
tal parameters operating at different time scales can ac- 
count for most of the experimentally observed data about 
changes in the gain and in the scale of temporal integra- 
tion [3l El [85l El], to our knowledge, these models 
largely have not been related to non-evolutionary, mech- 
anistic processes at molecular and cellular scales that un- 
derlie them. 

The largest inroads in this direction have been achieved 
when integration of a nonlinear function of a signal re- 
sults in an adaptation response that depends not just 
on the mean, but also on higher order cumulants of the 
signal, effectively mixing different kinds of adaptation 
together. This may be desirable in the cases of pho- 
toreception [7T] and chemosensing [80], where the signal 
mean is unalieanbly connected to the signal or the noise 
variances (e.g, the standard deviation of brightness of a 
visual scene scales linearly with the background illumi- 
nation, while the noise in the molecular concentration is 
proportional to the square root of the latter). Similarly, 
mixing means and variances allows the budding yeast to 
respond to fractional rather than additive changes of a 
pheromone concentration [52]. In other situations, like 
adaptation by a receptor with state-dependent inactiva- 
tion properties, similar mixing of the mean signal with 
its temporal correlation properties to form an adaptive 
response may not serve an obvious purpose |73j . 

We know very little about physiological mecha- 
nisms of adaptation of the second and the third 
kind. 

In a similar manner, integration of a strongly nonlin- 
ear function of a signal may allow a system to respond to 
signals in a gain-insensitive fashion, effectively adapting 
to the variance without a true adaptation. Specifically, 
one can threshold the stimulus around its mean value and 
then integrate it to count how long it has remained pos- 
itive. For any temporally correlated stimulus, the time 
since the last mean- value crossing is correlated to the in- 
stantaneous stimulus value (it takes long time to reach 
high stimulus values), and this correlation is indepen- 
dent of the gain. It has been argued that adaptation to 
the variance in fly motion estimation can be explained at 
least in part by this non-adaptive process [81] . Similar 
mechanisms are easy to implement in molecular signaling 
systems as well [9"5] . 

IV. WHAT'S NEXT? 

It is clear beyond that information theory has an im- 
portant role in biology. It is a mathematically correct 
construction for analysis of signal processing systems. It 



provides a general framework to recast adaptive processes 
on scales from evolutionary to physiological in terms of a 
(constrained) optimization problem. Sometimes it even 
makes (correct!) predictions about responses of living 
systems following exposure to various signals. 

So, what's next for information theory in the study of 
signal processing in living systems? 

The first, and the most important problem that still 
remains to be solved is that many of the stories we men- 
tioned above are incomplete. Since we never know for 
sure which specific aspect of the world, e(t), an organ- 
ism cares about, and the statistics of signals are hard to 
measure in the real world, an adaptation that seems to 
optimize I[S; R] may be an artifact of our choice of S and 
of assumptions about P(s), but not a consequence of the 
quest for optimality by an organism. For example, the 
time scale of filtering in E. coli chemotaxis }87| may be 
driven by the information optimization, or it may be a 
function of very different pressures. Similarly, a few stan- 
dard deviations mismatch between the cumulative distri- 
bution of light intensities and a photoreceptor response 
curve in fly vision [5J can be a sign of an imperfect ex- 
periment, or it can mean that we simply got (almost) 
lucky, and the two curves nearly matched by chance. It 
is difficult to make conclusions based on one data point! 

Therefore, to complete these and similar stories, the in- 
formation arguments must be used to make predictions 
about adaptations in novel environments, and such adap- 
tations must be observed experimentally. This has been 
done in some contexts in neuroscience [21 [HI H31 [94] , but 
molecular sensing lags behind. This is largely because 
evolutionary adaptation, too slow to observe, is expected 
to play a major role here, and because careful control 
of dynamic environments, or characterization of statis- 
tical properties of naturally occuring environments [83] 
needed for such experiments is not easy. New experimen- 
tal techniques, such as microfluidics (95] and artificially 
sped up evolution [96] are about to solve these problems, 
opening the proverbial doors wide open for a new class 
of experiments. 

The second important research direction, which will re- 
quire combined progress in experimental techniques and 
mathematical foundations, is likely going to be the re- 
turn of dynamics. This has had a revolutionary effect in 
neuroscience [10j . revealing responses unimaginable for 
quasi-steady-state stimuli, and dynamical stimulation is 
starting to take off in molecular systems as well (64J [97] . 
How good are living systems in filtering out those aspects 
of their time-dependent signals that are not predictive 
and are, therefore, of no use? What is the evolution- 
ary growth bound when signals change in a continuous, 
predictive fashion? None of these questions have been 
touched yet, whether theoretically or experimentally. 

Finally, we need to start building mechanistic models 
of adaption in living systems that are more complex than 
a simple subtraction of the mean. How are the amaz- 
ing adaptive behaviors of the second and the third kind 
achieved in practice on physiological scales? Does it even 
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make sense to distinguish the three different adaptations, 
or can some molecular or neural circuits achieve them 
all? How many and which parameters of the signal do 
neural and molecular circuits estimate and how? Some 
of these questions may be answered if one is capable of 
probing the subjects with high frequency, controlled sig- 
nals [55] ! and the recent technological advances will be a 
gamechanger as well. 

Overall, studying biological information processing 
over the next ten years will be an exciting pastime! 
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